Introduction to Open Data Science - Course Project

About Janina

A short description about the course and a link to my GitHub repository.

# "R chunk" for R code.

date()
## [1] "Thu Dec 09 16:07:00 2021"

How are you feeling right now? What do you expect to learn? Where did you hear about the course?

Hello!

My name is Janina and I am at the moment working under INAR (Atmospheric and Earth System Research) in Kumpula. I have been studying meteorology and I finished my master’s studies couple days ago! Now I am applying to Doctoral Studies with a topic of climate education. As I am moving towards education and educational research I am in need of R and statistical skills. One of my colleagues from Faculty of Educational studies suggested this course.

I am feeling quite excited. At first I was stressed to get back to studying but I bet everything goes well.


My Repository

Have a lovely day,

Janina


Chapter 2 - Regression and model validation


Describe the work you have done this week and summarize your learning.


Overview of the data


Let us look closer to the data - collected in 2014/15 in Finland from a survey about learning of students during a statistics course - resulting from the Data Wrangling.

date()
## [1] "Thu Dec 09 16:07:00 2021"
#Reading the data
learning2014 = read.table(file="learning2014.txt", sep=",", header=TRUE)

#Structure and dimensions of the data
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166   7


The structure of the data looks good. There are 166 observations of 7 variable. The variables are gender, age, attitude, deep (learning), stra(tegic approach), surf(ace approach) and points.

‘Deep’, ‘stra’ and ‘surf’ combine 8 subscales, all including 4 questions:

  • Deep learning: seeking meaning, relating ideas, use of evidence
  • Surface approach: lack of purpose, unrelated memorizing, syllabus-boundness
  • Strategic approach: organized studying, time management

‘Attitude’ includes 10 questions:

  • Koen osaavani tilastotiedettä
  • Tilastotieteestä tulee olemaan hyötyä työelämässä
  • Olen kiinnostunut ymmärtämään tilastotiedettä
  • Pärjäsin hyvin koulun matematiikassa
  • Olen kiinnostunut oppimaan tilastotiedettä
  • Koen epävarmuutta tilastotieteen tehtävien kanssa
  • Pidän tilastotieteestä
  • Tilastotiede pelottaa minua
  • Luotan matemaattisiin taitoihini
  • Tilastotiede on hyödyksi tulevissa opinnoissani

The values have been scaled by taking the mean. All the observations with points = 0 have been excluded.

Graphical overview of the data - Matrix plot


library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# Creating a plot matrix with ggpairs(), colouring has been set by gender (M=blue, F=red)
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

# Drawing the plot
p



So here we have a large plot matrix of all the 7 variables. Red colour indicates female answerers and blue men answerers. It is worth noticing that there are nearly twice as many female answerers as men and that majority of the students are 20 to 30 years old.

In the column most left we have the distribution of the observations presented in a horizontal way. The same distributions are presented also in a more visual way diagonally. The distributions of age and points are relatively similar between female and men answerers and the most differences can be found from distributions of attitude (female attitude more evenly distributed between answers of 1-4, men having a higher peak in attitude (3-4)) and surface approach (female having a higher peak (~3)). This could indicate that men had better attitude and motivation than female answerers.

Below this diagonal, we have scatter plots between two variables and above the diagonal we have correlations between two variables in a numerical way. The correlation coefficient, corr, refers to Pearson correlation and the values vary between -1 and +1. The closer the value of correlation is to |1|, the higher is the dependence between these two variables. In case the correlation is negative, the dependence is inverse: (the value of one variable (y) decreases when the value of the other variable (x) increases).

The highest correlation is between exam points and attitude (0.437). The next highest correlation is negative, and it is between surface approach and deep learning (-0.324). As the subscales of surface approach were mostly negative; lack of purpose, unrelated memorizing, syllabus-boundness, it does make sense that the correlation between deep learning and surface approach is negative. The weakest correlation (-0.0101) is between deep learning and exam points. It is interesting that these two do not correlate. Could this mean that the deep learning approach - seeking meaning, relating ideas, use of evidence - does not lead to learning that could be measured with exam points? Or is the exam measuring something else than deep learning?

Linear regression and correlation


Before going to the multiple regression, we will be looking at three chosen variables related to the exam point separately. The chosen explanatory variables are attitude, deep and surf. The significance will be considered separately and taken into account when moving to multiple regression.

#y ~ x
#y - Target (dependent) variable: Exam points
#x - Three variables as explanatory variables: x1 -> attitude (highest correlation), x2 -> deep (lowest correlation), x3 -> surf (most  negative correlation)



EXAM POINTS ~ ATTITUDE

#Creating a scatter plot y ~ x1 (positive correlation)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'



Here we have a linear regression between attitude (explanatory variable) and exam points (dependent variable). The fitted regression line fits well and is directed upward, as there is a positive correlation between the variables.

The positive regression coefficient can be numerically observed from the column ‘Estimate’ (summary of the lm-model below). Regression coefficient measures the change in the mean response related to a unit change in the corresponding explanatory variable when values of all the other explanatory variables remain unchanged. The next column, ‘Standard Error’ describes how far the residual points are from the regression line, ergo the difference between an observed value of the response variable and the fitted value. The residuals estimate the error terms in the model, the smaller the value, the less there is error.

The significance can be observed using p-value (Pr(>|t|)). The lower (closer to zero) the p-value is the less probability there is that the correlation (or coefficient of the regression model) would be zero. In case the p-value exceeds 0.05 the relation between explanatory and target variable has no significance, as there is over 5 % risk for the coefficient to be zero.

Let us look at the p-value of our case. One can see that the value is really small (4.12e-09 << 0.05). This indicates that there is a statistical relationship between attitude and exam points.

#Creating a regression model with the explanatory variables x1
my_model1 <- lm(points ~ attitude, data = learning2014)

#Printing out a summary of the model
print(summary(my_model1))
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09



EXAM POINTS ~ DEEP LEARNING

#Creating a scatter plot y ~ x2 (nearly non-existing correlation)
qplot(deep, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'



Here we have a linear regression between deep learning (explanatory variable) and exam points (dependent variable). The fitted regression line is nearly horizontal, which indicates that there is a weak correlation (if any).

Now, looking at the p-value (Pr(>|t|)) shows that the value is fairly large (0.897 >> 0.05). This indicates that there is no statistical relationship between deep learning and exam points.

#Creating a regression model with the explanatory variables x2
my_model2 <- lm(points ~ deep, data = learning2014)

#Printing out a summary of the model
print(summary(my_model2))
## 
## Call:
## lm(formula = points ~ deep, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6913  -3.6935   0.2862   4.9957  10.3537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.1141     3.0908   7.478 4.31e-12 ***
## deep         -0.1080     0.8306  -0.130    0.897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared:  0.000103,   Adjusted R-squared:  -0.005994 
## F-statistic: 0.01689 on 1 and 164 DF,  p-value: 0.8967



EXAM POINTS ~ SURFACE APPROACH

#Creating a scatter plot y ~ x3 (negative correlation)
qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'



Lastly we have a linear regression between surface approach (explanatory variable) and exam points (dependent variable). The fitted regression line is negative. Thus, the correlation between variables is inverse.

Let us check again the p-value (Pr(>|t|)). We can see that the value is just outside the level of significance (0.0635 > 0.05). This indicates that there is no statistical relationship between surface approach and exam points. However, the p-value is so close to the level of significance that in some cases it could be taken into account (this needs to be done with careful consideration).

#Creating a regression model with the explanatory variables x3
my_model3 <- lm(points ~ surf, data = learning2014)

#Printing out a summary of the model
print(summary(my_model3))
## 
## Call:
## lm(formula = points ~ surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6539  -3.3744   0.3574   4.4734  10.2234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.2017     2.4432  11.134   <2e-16 ***
## surf         -1.6091     0.8613  -1.868   0.0635 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared:  0.02084,    Adjusted R-squared:  0.01487 
## F-statistic:  3.49 on 1 and 164 DF,  p-value: 0.06351



To make the next task meaningful, I will be keeping two of the explanatory variables: attitude and surface approach. The latter does not have statistical significance but I would not be able to do the multiple regression, if I would be dropping it off.

Therefore, it stays.

As deep learning has no significance what so ever, it will be excluded from the next phase.

Multiple regression - Multiple R-squared


Below we have a summary of a multiple regression including all three variables. Already from the standard error we can see how the value is smallest for the attitude (strongest correlation) and highest for deep learning (weakest correlation).

The multiple R-squared is a square of the correlation coefficient between all three explanatory variables and exam points. The value in this case is 0.2024, meaning that slightly over 20% of the variation in exam points is explained by the variation in the three explanatory variables.

The value of multiple R-squared is known to increase when adding new explanatory variables, even if the added variable is not useful. Thereby, excluding a variable most likely will decrease the value of multiple R-squared. Thus, it may be useful to also look at the adjusted R-squared, as there the R-squared adjusts for the number of explanatory variables in a model. When all the variables are included the adjusted R-square is 0.1876 (~ 18.8 % of the variation explained).



#Creating a regression model with multiple explanatory variables
my_model4 <- lm(points ~ attitude + deep + surf, data = learning2014)

#Printing out a summary of the model
print(summary(my_model4))
## 
## Call:
## lm(formula = points ~ attitude + deep + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9168  -3.1487   0.3667   3.8326  11.3519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.3551     4.7124   3.895 0.000143 ***
## attitude      3.4661     0.5766   6.011 1.18e-08 ***
## deep         -0.9485     0.7903  -1.200 0.231815    
## surf         -1.0911     0.8360  -1.305 0.193669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1876 
## F-statistic:  13.7 on 3 and 162 DF,  p-value: 5.217e-08



As it was pointed above, the deep learning has no significance when it comes to the exam points. Let us examine what happens when we exclude it from the multiple regression.

#Creating a regression model with multiple explanatory variables
my_model5 <- lm(points ~ attitude + surf, data = learning2014)

#Printing out a summary of the model
print(summary(my_model5))
## 
## Call:
## lm(formula = points ~ attitude + surf, data = learning2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.277  -3.236   0.386   3.977  10.642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.1196     3.1271   4.515 1.21e-05 ***
## attitude      3.4264     0.5764   5.944 1.63e-08 ***
## surf         -0.7790     0.7956  -0.979    0.329    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 163 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1854 
## F-statistic: 19.78 on 2 and 163 DF,  p-value: 2.041e-08



As expected the value of multiple R-squared decreases, but only slightly. The new values (summary above) are;

  • Multiple R-squared 0.1953 (~ 19.5 % of the variation explained) vs. 20 %
  • Adjusted R-squared: 0.1854 (~ 18.5 % of the variation explained) vs. 18.8 %

There is only small decrease in the R-squared values after excluding deep learning. Thus, slightly less of the variance in exam points is explained by attitude + surf compared to the case of all three variables.

In addition to R-squared values we could also look at, for example, the p-value. When all the variables are included, the p-value is 5.217e-08 (first summary) and after deep learning has excluded the p-value is 2.041e-08 (second summary), ergo the significance has strengthen. Although, the significance is notable in both cases.

The small differences might be due to the fact, that the surface approach on its own does not have significance either. The R-squared value and p-value would be better when taking only attitude into account.



Diagnostic plots


Next we will draw three diagnostic plots out of the multiple regression, which explanatory variables are above-mentioned attitude and surface approach The chosen diagnostic plots are Residuals vs Fitted, Normal QQ-plot and Residuals vs Leverage (table below, 1, 2, 5).

cheatsheet

which Graphics
1 Residuals vs Fitted values
2 Normal QQ-plot
3 Standardized residuals vs Fitted values
4 Cook’s distances
5 Residuals vs Leverage
6 Cook’s distance vs Leverage
#Creating a regression model with multiple explanatory variables
my_model6 <- lm(points ~ attitude + surf, data = learning2014)

#Drawing diagnostic plots using the plot() function. Plots 1, 2 and 5 chosen
par(mfrow = c(2,2))
plot(my_model6, which = c(1,2,5))



The first diagnostic plot (Residual vs Fitted) shows if residuals have non-linear patterns or not. Let us see how well the scattered residuals distribute around the fitted regression line. As the dots are evenly distributed throughout the horizontal line, the fitted model seems to be appropriate with no distinctive pattern. Only couple of the observations are standing out on the negative side of the red line. These points are extreme points and have been numbered (145, 56, 35).

The second diagnostic plot (Normal Q-Q) is for checking if the residuals are normally distributed. As the residuals are mainly well lined with the straight dashed line we may say that the residuals are normally distributed. However, the residuals are seldom in a perfect straight line, and also here we can see how the extreme points (numbered) are notably on the negative side of the dashed line. Similar behavior can be observed on the upper corner of the plot. Great majority of the standardized residuals do follow the dashed line moderately well.

The last diagnostic plot (Residuals vs Leverage) indicates if there are any influential observations. The leverage of an observation is based on how much the observation’s value on the explanatory variable differs from the mean of the explanatory variable itself. The larger the leverage the more influence the observation potentially have on the estimations of the model parameters. The red dashed line is a Cook’s distance, which measures the influence of an observation on the estimation of all the parameters in the model. Our case is quite typical as there is no influential cases. All the residuals are well inside the Coock’s distance lines (which are actually not even visible in the figure). If there would be some values at the upper and lower right corner one would need to be careful. Those observations would be influential against a regression line.





Summary of my learning


This week’s tasks have been mainly fun and really educating. However, they have required a several days full of work. The given workload is not light for me as I do not have much previous knowledge of the statistical methods. A lot new to learn. The textbook is helpful in this sense but there is a lot of material to scan through. It is good to have real data to work with. It helps to understand what do all the correlations, significance thresholds and diagnostic plots really mean. Especially the diagnostic plots and their interpretations were new to me.

Coding with R is not familiar to me, but my previous coding background with python is somewhat helpful. The DataCamp is extremely supportive and it does pay off to do the DataCamp-exercises carefully first. It takes time but helps with producing my own codes. I got the practicalities work last week and it makes me happy that everything is still working this week. Thus, I can fully focus on the tasks and the analysis.


Regards, Janina


Chapter 3 - Logistic Regression


Overview of the data


The purpose of the analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. Let us look closer to the cleaned data - collected in 2014 in Portuguese secondary schools from a survey about alcohol consumption and students’ achievements in mathematics (math) and Portuguese language (por) - resulting from the Data Wrangling.

The same answerers of two different classes have been combined. In addition;

  • The variables not used for joining the two data have been combined by averaging (including the grade variables)
  • ‘alc_use’ is the average of alcohol usage during weekdays ‘Dalc’ and and weekends ‘Walc’
  • ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

date()
## [1] "Thu Dec 09 16:07:09 2021"
#Reading the data
pormath = read.table(file="pormath.txt", sep=",", header=TRUE)

#Structure and dimensions of the data
str(pormath)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr  "R" "R" "R" "R" ...
##  $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr  "T" "T" "T" "T" ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr  "other" "other" "other" "health" ...
##  $ reason    : chr  "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr  "mother" "mother" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr  "yes" "yes" "yes" "yes" ...
##  $ activities: chr  "yes" "no" "yes" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "yes" "yes" "no" "yes" ...
##  $ romantic  : chr  "no" "yes" "no" "no" ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : chr  "yes" "no" "no" "no" ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : chr  "yes" "no" "no" "no" ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : chr  "yes" "no" "yes" "yes" ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
#Names of the columns
colnames(pormath)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"


The structure of the data looks good. There are 370 observations of 51 variables (see colnames above), such as gender, age, information about family’s education, study time, alcohol consumption etc. The entire attribute information you can find from here.

Chosen variables and hypothesis


library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Drawing a bar plot of each variable
gather(pormath) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()


To examine the relationships related to the alcohol consumption, some variables need to be chosen. In this report we have chosen age, absences, going out with friends (goout) and final grade (G3) to be compared with alcohol consumption (alc_use) and whether it is highly used or not (high_use).

My personal hypotheses

  • Student’s age in relation to alcohol consumption
    • The hypothesis is, that there is positive correlation between these two variables, ergo alcohol consumption is increasing with age. This is mainly due to the law that sets an age limit for alcohol consumption.

  • Student’s frequency to go out with friends in relation to alcohol consumption
    • The hypothesis is, that there is positive correlation between these two variables, ergo alcohol consumption is the higher the more the student goes out with friends. This is based on the possibility to use alcohol in parties and hang-outs with groups of friends. Alcohol may be more likely to be used with friends than alone.

  • Student’s absences (from lessons) in relation to alcohol consumption
    • The hypothesis is, that there is positive correlation between these two variables, ergo alcohol consumption is the higher the more the student have absences. This is just a hypothesis, the absence does not mean that the student would be out drinking. However, there is stronger possibility for it when it is known that the student is not in school.

  • Student’s final grade in relation to alcohol consumption
    • The hypothesis is, that there is negative correlation between these two variables, ergo the higher grades the student have the less the student uses alcohol. The alcohol usage does not necessary affect the ability to success in school subjects. However, again, the alcohol is unlikely to be used while studying. So the high alcohol consumption may replace the time and capacity of studying.



Numerical and graphical interpretation


Let us begin with a numerical overview of all four chosen variables and high/low alcohol consumption; (high_use) ~ age, goout, absences and final grade (G3):

#Producing summary statistics by group from the chosen variables

#Gender, high/low alcohol consumption, number of observations, mean age
pormath %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_age
##   <chr> <lgl>    <int>    <dbl>
## 1 F     FALSE      154     16.6
## 2 F     TRUE        41     16.5
## 3 M     FALSE      105     16.3
## 4 M     TRUE        70     16.9


The mean age is approximately equal for all females and for males that do not use much alcohol. Only the age of males having a high alcohol consumption is slightly higher. The number of females using a lot of alcohol (41, see females high_use FALSE > 154) is the smallest out of these four categories (females and males with high/low alcohol consumption). Also the male students using alcohol more than 2 times a week (high_use) are in minority (70, see males high_use FALSE > 105).

#Gender, high/low alcohol consumption, number of observations, mean of going out
pormath %>% group_by(sex, high_use) %>% summarise(count = n(), mean_go_out = mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_go_out
##   <chr> <lgl>    <int>       <dbl>
## 1 F     FALSE      154        2.95
## 2 F     TRUE        41        3.39
## 3 M     FALSE      105        2.70
## 4 M     TRUE        70        3.93


The mean of going out with friends seem to be higher for those who have high alcohol consumption, in spite of the gender. In addition, females that have low alcohol consumption go out more often than males with low alcohol consumption.

#Gender, high/low alcohol consumption, number of observations, mean of absences
pormath %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_absences
##   <chr> <lgl>    <int>         <dbl>
## 1 F     FALSE      154          4.25
## 2 F     TRUE        41          6.85
## 3 M     FALSE      105          2.91
## 4 M     TRUE        70          6.1


The absences have clear differences related to high/low alcohol consumption. For both, females and males, having a high alcohol consumption the mean of absences is over 6 times when compared to the low alcohol consumers (see females 4.25 and males 2.91). Females that have high alcohol consumption have more absences than males with high alcohol consumption.

#Gender, high/low alcohol consumption, number of observations, mean grade
pormath %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       11.4
## 2 F     TRUE        41       11.8
## 3 M     FALSE      105       12.3
## 4 M     TRUE        70       10.3


The final grades seem very similar for all females, despite their alcohol use. There is even slight increase in grades among females using alcohol more frequently (11.8 vs. 11.4). For males a clear difference can be observed. Males not using much alcohol have a lot higher grades (12.3) than the ones using alcohol more often (10.3).

Bar plots


Let us continue by drawing a bar plots out of all four chosen variables and alcohol consumption; A age, B goout, C absences, D final grade (G3), E alc_use and F high_use:

library(tidyr); library(dplyr); library(ggplot2); library(cowplot)

#Drawing a bar plot of each variable (alc_use, high_use and 4 chosen variables)


#Initializing a barplot of age
g1 <- ggplot(pormath, aes(x = age, fill = sex)) + geom_bar() #+ ylab("age")

#... of going out with friends (goout)
g2 <- ggplot(pormath, aes(x = goout, fill = sex)) + geom_bar()

#... of absences
g3 <- ggplot(pormath, aes(x = absences, fill = sex)) + geom_bar()

#... of final grade (G3)
g4 <- ggplot(pormath, aes(x = G3, fill = sex)) + geom_bar()

#... of alcohol consumption (alc_use)
g5 <- ggplot(pormath, aes(x = alc_use, fill = sex)) + geom_bar()

#... of whether the consumption is high/low (high_use)
g6 <- ggplot(pormath, aes(x = high_use, fill = sex)) + geom_bar()


#Plotting all 4 barplots
plot_grid(g1, g2, g3, g4, g5, g6, labels="AUTO")


All the bar plots look reasonable. There seems to be more female answerers, which ages are varying mainly between 15-18 years. Males seem to go out with friends just slightly more often than females (goout - bar 5) and there are more males with high alcohol consumption.

Let us now examine the relationships and correlation between chosen 4 variables and alcohol consumption:

library(GGally)
library(ggplot2)
library(dplyr)

#Choosing a handful of columns to keep
keep_columns <- c("sex", "age", "goout", "absences", "G3", "alc_use", "high_use")

#Selecting the 'keep_columns' to create a new dataset
data <- select(pormath, one_of(keep_columns))

#Creating a plot matrix with ggpairs(), colouring has been set by gender (M=blue, F=red)
p <- ggpairs(data, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

#Drawing the plot
p


There are few more female answerers than men. The distribution shown along the diagonal of the plot series are fairly similar between female and male, excluding age and alcohol consumption. The age distribution has higher peaks for females and high alcohol usage is slightly more common among males.

There seems to be high correlation between alcohol consumption and all chosen variables; age (0.16), goout (0.389), absences (0.214), G3 - final grade (-0.156). Only the correlation between final grade and alcohol consumption is negative. These relations follow the hypothesis presented above. However, there are notable differences between two genders. Age and final grades for females do not correlate with alcohol consumption. For male answerers all chosen variables do correlate with alcohol consumption.

According to the correlations the hypotheses are mainly good. I did not make differences between female and male students so that brought couple of surprises. The lack of correlation of age and final grades with alcohol usage for females was not expected. When the genders are combined the observations follow the above-mentioned hypotheses.

Box plots


Let us still look at the box plots. We will be comparing the chosen variables with whether the student has a high alcohol consumption or not; A age, B goout, C absences, D final grade (G3).

library(cowplot)

#Let us do boxplots out of 4 variable sets

#Initializing a boxplot of high_use and age
g11 <- ggplot(pormath, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + ylab("age")

#... of high_use and going out with friends (goout)
g22 <- ggplot(pormath, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("going out")

#... of high_use and absences
g33 <- ggplot(pormath, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + ylab("absences")

#... of high_use and final grade (G3)
g44 <- ggplot(pormath, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("grade")

#Plotting all 4 boxplots
plot_grid(g11, g22, g33, g44, labels = "AUTO")


In box plots the thicker line in the middle of the box shows the median of the answers. The edges of the box show the lower and upper quartile of the observations. The vertical lines outside the box are called ‘whiskers’ and the dots represent the maximum and minimum answers.

High consumption of alcohol is slightly more common among males with higher age (~17 years). Otherwise the age does not have clear influence according to the box plot. Instead, the correlation between going out with friends and high alcohol usage is clearly visible. The median of goout for females and males is approximately 4 times a week where the students with low alcohol consumption have a median of goout = 3.

Also absences are more common when there is high alcohol usage (positive correlation). The differences may seem small in the box plot, but this is only because the maximum values of absences are far from the median (exceeding even 40 absences). However, the above examined numerical tables and correlations reveal the truth among students that are not missing the classes over 20 times. When looking at the final grades the negative correlation is visible with males, ergo for males the grades are higher when the alcohol consumption is low. For females there are no clear differences perceptible.

Logistic regression model


Let us do a logistic regression model to statistically explore the relationship between the above chosen four variables and the binary high/low usage of alcohol (target variable).


#Finding the model with glm()
m <- glm(high_use ~ age + goout + absences + G3, data = pormath, family = "binomial")

#Printing out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + goout + absences + G3, family = "binomial", 
##     data = pormath)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8967  -0.7525  -0.5409   0.9467   2.2946  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.77834    1.96186  -1.926  0.05412 .  
## age          0.04563    0.11022   0.414  0.67890    
## goout        0.70668    0.11917   5.930 3.03e-09 ***
## absences     0.07315    0.02237   3.270  0.00108 ** 
## G3          -0.04472    0.03923  -1.140  0.25435    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 388.67  on 365  degrees of freedom
## AIC: 398.67
## 
## Number of Fisher Scoring iterations: 4
#Printing out the coefficients of the model
coef(m)
## (Intercept)         age       goout    absences          G3 
## -3.77834234  0.04562870  0.70667982  0.07314823 -0.04471930


We will start from the p-values (Pr(>|z|)) of the variables. The z and p-values in the regression table tell whether the pairwise difference between the coefficient of the reference class and the other class is different from zero or not. Variables goout and absences stand out as they are clearly significant in terms of explaining the target variable. Age and final grades (G3) seem to be just outside the level of significance. However, for example, if only one categorical class would be significant, it would not imply that the whole variable is meaningless and should be removed from the model.

Odds ratios and confidence interval


Next we will look closer to coefficients of the model as odds ratios. We will also provide a confidence interval for them.

#Computing odds ratios (OR)
OR <- coef(m) %>% exp

#Computing confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
#Printing out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR        2.5 %   97.5 %
## (Intercept) 0.02286056 0.0004637374 1.033922
## age         1.04668570 0.8430071337 1.299954
## goout       2.02724925 1.6139414455 2.577757
## absences    1.07589001 1.0312932049 1.127070
## G3          0.95626587 0.8852281993 1.032878


The odds ratio is used to quantify the relationship between chosen variables and target variable. Odds higher than 1 mean that the chosen variable is associated with TRUE of the target variable (student has high alcohol consumption). The odds ratio is highest for go_out. Odds ratio exceeds one also for age and absences. Final grade is the only which odds ratio is remaining under 1 (~0.96).

Let us look closer to the confidence intervals. In case the confidence interval contains the value one we can conclude that there is no evidence of an association between chosen variable and target variable. In our case this applies for age (0.8 - 1.3) and final grade (0.9 - 1.0). Thus, there is no association between high_use and age or final grade. Instead, both go_out and absences’ confidence intervals are above one, and thereby they do have some association with the high alcohol consumption.

These results would lead to a new conclusion. Even though there has been small correlation visible between age and final grades in terms of high_use, according to odds ratios and confidence intervals they should not be included into the model anymore. Their association is too small or casual. As predicted in the hypothesis going out with friends and absences from lessons do associate with high alcohol consumption. Thus, let us keep these two variables (goout, absences) and move on to explore the predictive power of our model.

Predictive power of the model


Using the variables which, according to the logistic regression model above, had a statistical relationship with high/low alcohol consumption (goout, absences), we will now explore the predictive power of our model.

#Fitting the model
m <- glm(high_use ~ goout + absences, data = pormath, family = "binomial")

#Predicting - predict() - the probability of high_use
probabilities <- predict(m, type = "response")

#Adding the predicted probabilities to 'alc'
alc <- mutate(pormath, probability = probabilities)

#Using the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

#Let us see the last ten original classes, predicted probabilities, and class predictions
select(alc, goout, absences, sex, high_use, probability, prediction) %>% tail(10)
##     goout absences sex high_use probability prediction
## 361     3        7   M     TRUE  0.28963176      FALSE
## 362     3        3   M     TRUE  0.23104046      FALSE
## 363     1        2   M     TRUE  0.06029226      FALSE
## 364     4        4   M     TRUE  0.40315734      FALSE
## 365     2        3   M    FALSE  0.12606082      FALSE
## 366     3        4   M     TRUE  0.24487648      FALSE
## 367     2        0   M    FALSE  0.10291931      FALSE
## 368     4        4   M     TRUE  0.40315734      FALSE
## 369     4        8   M     TRUE  0.47825016      FALSE
## 370     2        0   M    FALSE  0.10291931      FALSE


Above we have now a table for comparing the real values and the values originating from the probabilities and predictions. In case probability (to use alcohol more than 2 times a week) is 50% (0.5) the prediction results TRUE. In the table above we have last 10 observations of our data and one could say that the predictions do not look too good. 7 out of the 10 observations are in reality categorized as high_use = TRUE, but the predictions are all FALSE.

Let us tabulate the target variable (high_use) versus the predictions to see how well the predictions actually work.

#Let us tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   236   23
##    TRUE     65   46


The cases where high_use and prediction both result to FALSE or TRUE are the cases, where observations have been predicted successfully. The contradictions, with TRUE and FALSE or FALSE and TRUE, are the ones that have been predicted wrong (up right value, low left value). As most of the students do not use a lot of alcohol it is reasonable, that FALSE FALSE value is clearly highest. To see better the share of the other cases let us take percentages in use.

#Tabulating the target variable versus the predictions as percentages
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63783784 0.06216216 0.70000000
##    TRUE  0.17567568 0.12432432 0.30000000
##    Sum   0.81351351 0.18648649 1.00000000


Above we see that the model has got ~64 % correct of the students not using a lot of alcohol. ~19 % have been predicted correctly to have a high alcohol consumption.

The share of the inaccurately classified individuals (miss-predicted observations) - ergo training error - has been calculated using below defined loss_function. The function in practice sums the wrong predictions in the data. The share of training error is ~0.24 (~0.06 + ~0.18 = ~0.24).

#Defining a loss function (mean prediction error)
#Class is here high_use
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

#Calling loss_func to compute the average number of wrong predictions in the data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378



10-fold cross-validation


Let us perform a 10-fold cross-validation on our model.

#Defining a loss function (average prediction error) 
loss_func2 <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

#Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2432432


The training error of our model is 0.2378378 ~ 0.24. This is slightly less than what the 10-fold cross-validation suggests. This implies that our model has a better test set performance. For some reason the prediction error of 10-fold cross-validation varies somewhat every time it has been ran. The varying happens between (24 % - 26 %)



Summary of my learning


This week’s tasks were more challenging than lats week. Even now, everything is not clear but I think I managed well to complete all the necessary tasks + one bonus task. New plots were used and learned. Using the prediction model and all those odds ratios and 10-fold cross-validations were new and interesting ways to check if the variables do really associate and whether they could be predicted.


Regards, Janina


Chapter 4 - Clustering and classification


Overview of the data


In this analysis we will be using data set called “Boston”, which can be found from library called “MASS” (source). Let us look closer how does the data look like.

date()
## [1] "Thu Dec 09 16:07:23 2021"
#Accessing the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
#Loading the data
data("Boston")

#Exploring the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00


The Boston data frame has 506 rows and 14 columns. The 14 columns are:

  • crim > per capita crime rate by town.
  • zn > proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus > proportion of non-retail business acres per town.
  • chas > Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox > nitrogen oxides concentration (parts per 10 million).
  • rm > average number of rooms per dwelling.
  • age > proportion of owner-occupied units built prior to 1940.
  • dis > weighted mean of distances to five Boston employment centres.
  • rad > index of accessibility to radial highways.
  • tax > full-value property-tax rate per $10,000.
  • ptratio > pupil-teacher ratio by town.
  • black > 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat > lower status of the population (percent).
  • medv > median value of owner-occupied homes in $1000s

Let us next see the plot matrix of all the variables.



#Plotting matrix of the variables
pairs(Boston)


And then the correlation matrix The matrix is easy way to examine the correlations between all variables.

library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
#Calculating the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

#Priunting the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
#Visualizing the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)


The correlations exceeding 0.6 are between: crim - rad (0.63), zn - dis (0.66), indus - nox (0.76), indus - age (0.64), indus - dis (0.6), indus - tax (0.72), nox - age (0.73), nox - rad (0.61), nox - tax (0.67), rad - tax (0.91).

  • The highest correlation is between rad - tax (0.91), ergo between the index of accessibility to radial highways and a full-value property-tax rate per $10,000.

Correlations lower than -0.6 are between: indus - dis (-0.71), nox - dis (-0.77), age - dis (-0.75).

  • The lowest correlation is between nox - dis (-0.77), ergo between the nitrogen oxides concentration (parts per 10 million) and weighted mean of distances to five Boston employment centres.

Scaling and standardizing


For later use, we need to scale our data. In scaling, the column means will be subtracted from the corresponding columns and then the difference will be divided with standard deviation.

As the “Boston” data set contains only numerical values, the whole data needs to be standardized first using function scale().

#Let us center and standardize variables
boston_scaled <- scale(Boston)

#Summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
#The class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"


Standardizing the data leads to setting the mean of all the variables to 0. In order to maximize compatibility of the variables.

The class of the scaled Boston data set is an array matrix.

Later on, we will want the data to be a data frame. Thereby, we will next convert the scaled Boston-data set to a data frame format.

#Changing the object to data frame
boston_scaled <- as.data.frame(boston_scaled)


We will be focusing on the crime rate in the Boston (data set). To use and analyse the variable crim(e) we need to create a categorical variable from a continuous one. Thus, let us change crim to a crime (rate per capita by town). We will cut the variable by quantiles (used as break points) to get the high, low and middle rates of crime into their own categories (> categorical variable).

#Creating a quantile vector of 'crim' and printing it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#Creating a categorical variable 'crime' according to the quantiles set in bins
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low","med_low","med_high","high"))

#Let us look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#Dropping original variable 'crim' from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

#Replacing removed 'crim' with the new categorical value 'crime' to scaled data
boston_scaled <- data.frame(boston_scaled, crime)


The categorized crime data consists of as even categories as possible:

  • 0%-25% low > (127)
  • 25%-50% med_low > (126)
  • 50%-75% med_high > (126)
  • 75%-100% high > (127)


Next we will divide our data set to ‘train’ and ‘test’ sets. 80% of the data will go to the ‘train’ set.

#The number of rows in the Boston data set 
n <- nrow(boston_scaled)

#Choosing randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

#Creating 'train' set (80% of the rows)
train <- boston_scaled[ind,]

#Creating 'test' set (20% of the rows)
test <- boston_scaled[-ind,]

#Let us save the correct classes from 'test' data
correct_classes <- test$crime

#And remove the crime variable from 'test' data
test <- dplyr::select(test, -crime)


Linear discriminant analysis


Now we will apply the linear discriminant analysis on the train set. LDA is a classification (and dimension reduction) method used to find a linear combination of features characterizing 2 or more classes of objects. The resulting combinations may be used as a linear classifier or for dimensionality reduction before later classification.

In the LDA below the categorical crime rate is used as the target variable and all the other variables in the data set act as predictor variables. Then there is a LDA biplot.

#A linear discriminant analysis (LDA)
#Crime rate as target variable, all the others (.) as predictor variables
lda.fit <- lda(crime ~ ., data = train)

#Printing the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2326733 0.2772277 0.2500000 0.2400990 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       0.9744596 -0.9098180 -0.104792887 -0.8876322  0.4853945 -0.8680449
## med_low  -0.1269957 -0.2896550  0.008892378 -0.5600847 -0.1810965 -0.2921296
## med_high -0.3751653  0.2003698  0.195445218  0.4402730  0.1093779  0.4326837
## high     -0.4872402  1.0149946 -0.069385757  1.0423878 -0.4531883  0.8039125
##                 dis        rad        tax    ptratio       black       lstat
## low       0.8786761 -0.7020851 -0.7307117 -0.4875567  0.38132823 -0.77452569
## med_low   0.3159263 -0.5450436 -0.4769806 -0.0318375  0.31876387 -0.07921127
## med_high -0.3916494 -0.4087748 -0.2886782 -0.3608757  0.04694643 -0.02024186
## high     -0.8390339  1.6596029  1.5294129  0.8057784 -0.88156866  0.86153824
##                 medv
## low       0.57235431
## med_low  -0.03773927
## med_high  0.19935953
## high     -0.69192804
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.12173157  0.660094102 -0.91420474
## indus    0.04302670 -0.227110531  0.22980204
## chas    -0.06548305 -0.054965234  0.15016027
## nox      0.21476876 -0.928161508 -1.28178102
## rm      -0.12378505 -0.051310938 -0.19196261
## age      0.27334440 -0.305334088 -0.09071289
## dis     -0.10659761 -0.310389387  0.07458488
## rad      3.48633098  0.899749430 -0.02330908
## tax     -0.01886255  0.007582303  0.35954274
## ptratio  0.12821085  0.025139492 -0.22688665
## black   -0.15444489  0.013821395  0.12573932
## lstat    0.13299143 -0.119509268  0.40345910
## medv     0.15883296 -0.360933976 -0.23346860
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9501 0.0357 0.0142
#The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

#Target classes as numeric
classes <- as.numeric(train$crime)

#Plotting the results of lda
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)


In the graph above we can see, that only LD1 separates the classes nicely. LD2 does not give any valuable information related to the separation. The arrow called ‘rad’ is the longest and directed towards the ‘high’ classes. Also the arrows of variables ‘zn’ and ‘nox’ are differing from the others. ‘zn’ is directed upward and ‘nox’ downward.

Predicting classes with LDA


Let us try to predict the values based on our model. The data was split earlier on in order to have a test set and the correct class labels. Now we will test how the LDA model performs when predicting the classes on test data.

Below, there is a cross tabulation of the results with the crime categories from the test set.

#Prediction of the classes with test data
lda.pred <- predict(lda.fit, newdata = test)

#Cross tabulation of the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19      14        0    0
##   med_low    4       7        3    0
##   med_high   1       9       14    1
##   high       0       0        1   29


The cross tabulation results in table, where correct answers are on the left hand side as rows and predictions as columns. It is good to see that most of the observations have been predicted correctly (numbers in diagonal). However, there are several spots with wrong predictions. The highest number of mistakes in prediction are under ‘med_low’. The model has predicted several low and med_high cases as med_low. The exact number is changing after each run due to the randomnes in the script.

Clustering with K-means


Distances between observations


Let us start by calculating the (euclidean and manhattan) distances between observations.

#Reloading MASS and Boston data sets
library(MASS)
data('Boston')

#Scaling and standardizing the data set 'Boston'
boston_scaled1 <- scale(Boston)

#Changing the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled1)


#Euclidean distance matrix
dist_eu <- dist(boston_scaled2)

#Let us see the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#Manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")

#Let us now see the summary of the distances again
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618


Now we continue to the K-means algorithm. K-means is one of the most used clustering method. The method assigns observations into groups (clusters) based on similarity of the objects. The K-means (function kmeans()) counts the distance matrix (done above) automatically. As there are quite a lot of variables, 5 columns (columns 6 to 10) have been paired up to make examination more clear. The first 5 columns are less informative, and thereby the last 5 columns have been chosen.

#K-means clustering
km <-kmeans(boston_scaled2, centers = 3)

#Plotting the Boston dataset with clusters
#5 columns (columns 6 to 10) have been paired up to make examination more clear
pairs(boston_scaled2[6:10], col = km$cluster)


As K-means takes the amount of clusters in as an argument, we need to look for the optimal amount of clusters. One way of estimating the optimal number of clusters has been tested below. There we look how the total of within cluster sum of squares (WCSS) acts when the number of clusters change.

K-means may produce different results every time, as it randomly assigns the initial cluster centers. The function set.seed() is used to deal with it.

library(ggplot2)

#To ensure that the result does not vary every time
set.seed(123)

#Determining the number of clusters
k_max <- 10

#Calculating the total within sum of squares (twcss)
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

#Visualizing the results
qplot(x = 1:k_max, y = twcss, geom = 'line')


The optimal number of clusters is where the total WCSS drops radically. In the figure above, the most radical drop is just in the beginning and the drop ends at value 2. This indicates, that the optimal number of clusters would be 2. Let us run the K-means algorithm again, using the optimal number of clusters (2).

#K-means clustering
km <-kmeans(boston_scaled2, centers = 2)

#Plotting the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)


The different variables cause a different shaped clustering. Variables ‘rad’ and ‘tax’ cause two clearly separated groups (one in the area of -1 - 0 and the other in the range of 1.5).

‘age’ and ‘dis’ cause a more connected clusters, where the first cluster gradually changes to the next one. In ‘dis’ the first cluster starts from -1 and as the value increases the second cluster begins. in ‘age’ the first cluster starts from 1 and as the value decreases the second cluster begins.



Summary of my learning


This week I felt slightly lost. I managed very well to do everything that I was asked to do but the deeper understanding is not fully there. I can make graphs and tables and I can tell what I see… But, for example, what do the arrows (of variables) really tell me in the LDA graph? Can the distances on their own be interpret/useful somehow? How do we utilize the two resulting groups from the clustering? Maybe I’ll learn the answer’s later on. For now, I think I just need to settle on executing the tasks.


Regards, Janina


Chapter 5 - Dimensionality reduction techniques


Overview of the data


In this analysis we will be using data set called “human”, which is a combination of the following two data sets:

Meta file for these data sets can be found from here

Let us look closer how does the data look like.

date()
## [1] "Thu Dec 09 16:07:30 2021"
human = read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)

#Using my own table did not work for some reason
#human = read.table(file="human5.txt", sep=",", header=TRUE)

#Let us see the dimensions and structure of the data set
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...


There are 155 observations of 8 variables:

  • Edu2.FM “PSE_ratio” > Ratio of Female and Male populations with secondary education in each country
  • Labo.FM “LFPR_ratio” > Ratio of labour force participation of females and males in each country
  • Edu.Exp “EYE” > Expected Years of Education
  • Life.Exp “LEB” > Life Expectancy at Birth
  • GNI > Gross National Income (GNI) per Capita
  • Mat.Mor “MMR” > Maternal Mortality Ratio
  • Ado.Birth “ABR” > Adolescent Birth Rate
  • Parli.F “PRP” > Percent Representation in Parliament

Let us next show a graphical overview of the data:

library(GGally)
library(dplyr)
library(corrplot)

#Visualizing the 'human' variables
ggpairs(human)


Above we have used a generalized pairs plot from the GGally package to visualize the data frame.

From the distributions of the variables we can say, that the expected years of life and education are both relatively high. The ratio of labour force participation of females and males in each country shows also a relatively high peak. In turn, the amount of GNI, maternal mortality and adolesent birth rate are relatively low. Also the percent representation of females in parliament is quite low.

#Computing the correlation matrix and visualizing it with corrplot
cor(human) %>% corrplot


To study the linear connections, a cor() function has been used. The output has been visualized with the corrplot function. From the corrplot we can see that:

  • The highest correlation is between
    • Life Expectancy at Birth and Expected Years of Education (0.789)
    • Adolescent Birth Rate and Maternal Mortality Ratio (0.759)

  • The lowest (ergo negative) correlation is between
    • Maternal Mortality Ratio and Life Expectancy at Birth (-0.857)
    • Maternal Mortality Ratio and Expected Years of Education (-0.736)
    • Adolescent Birth Rate and Life Expectancy at Birth (-0.729)


Next we will move to the PCA


Principal Component Analysis (PCA)


Principal component analysis (PCA) is used in exploratory data analysis and for making predictive models. It is a statistical procedure that decomposes a matrix into a product of smaller matrices and then reveals the most important component.

In PCA, the data is first transformed to a new space with equal or less number of dimensions > new features. These new features are called the principal components, which always have the following properties:

  • 1st principal component captures the maximum amount of variance from the features in the original data.
  • 2nd principal component is orthogonal to the first and it captures the maximum amount of variability left.

  • All principal components are uncorrelated and each is less important than the previous one, in terms of captured variance.


Let us perform the PCA with Singular Value Decomposition (SVD) for our (not standardized) data.



#Performing PCA with the SVD method
pca_human <- prcomp(human)

#Creating a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
#Rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

#The percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
#Creating object pc_lab (by the first two principal components) to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

#Drawing a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped


And now we repeat the Principal Component Analysis and its visualization for standardized ‘human’ data.

#Standardizing the variables
human_s <- scale(human)

#Performing PCA with the SVD method
pca_human_s <- prcomp(human_s)

#Creating a summary of pca_human
s_s <- summary(pca_human_s)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
#Rounded percetanges of variance captured by each PC
pca_pr_s <- round(100*s_s$importance[2, ], digits = 1)

#The percentages of variance
pca_pr_s
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
#Creating object pc_lab (by the first two principal components) to be used as axis labels
pc_lab_s <- paste0(names(pca_pr_s), " (", pca_pr_s, "%)")

#Drawing a biplot
biplot(pca_human_s, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_s[1], ylab = pc_lab_s[2])


Comparison between standardized and non-standardized PCA


The importance of components (see tables above): Standard deviation, Proportion of Variance, Cumulative Proportion are all equal to both cases (standardized and not).

However, as we do the rounding of variance captured by each PC notable differences occur. In case the data has not been standardized all the variance is captured by the first PC (PC1). This makes the scaling of the biplot odd and the graph hard to look at and interpret. The Gross National Income (GNI) per Capita is overpowering all the other features and it is contributing to the dimension of PC1 (which makes sense, as all the variation has been captured by PC1).

When the data has been standardized the variance is captured by all 8 PC’s:

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
53.6% 16.2% 9.6% 7.6% 5.5% 3.6% 2.6% 1.3%

The points presented in the beginning of PCA-chapter apply: most of the variance is captured by the firs PC (PC1 - 53.6%) and every PC is less important than the previous one.

From the standardized graph it is easy to interpret the relation between variables:


notes

1. A scatter plot is drawn where the observations are placed on x and y coordinates defined by two first principal components PC1 (x) and PC2 (y)

2. Arrows are drawn to visualize the connections between the original features and the PC’s

  • Angle between arrows > correlation between the features (small angle = high positive correlation)

  • Angle between a feature and a PC axis > correlation between the two. (small angle = high positive correlation)

  • Length of the arrow is proportional to the standard deviation of the feature


In the standardized graph:

  • There is positive correlation between following features:

    • 1st batch (pointing left): Expected Years of Education, Gross National Income (GNI) per Capita, Ratio of Female and Male populations with secondary education in each country and Life Expectancy at Birth

    • 2nd batch (pointing up): Percent Representation in Parliament and Ratio of labour force participation of females and males in each country

    • 3rd batch (pointing right): Maternal Mortality Ratio and Adolescent Birth Rate

  • There is no correlation between arrows that are orthogonal:

    • The 2nd batch (defined above) do not correlate with features from any of the other batches.

  • There is negative correlation between arrows that are pointing to opposite directions:

    • The 1st and 3rd batches of variables correlate negatively between each other.


About the contribution of the principle components:

  • 1st and 3rd batches of arrows are aligned with the x-axis (PC1). So these features are contributing to that dimension.

  • 2nd batch of arrows is aligned with the y-axis (PC2). So these features are contributing to that dimension.



Multiple Correspondence Analysis


For Multiple Correspondence Analysis, let us first load the ‘tea’ data set from the package ‘Factominer’. The structure of the data is explored and visualized below.

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
library(ggplot2)
library(tidyr)

#Loading the data
data("tea")

#Exploring the data set
str(tea) #300 obs. of  36 variables:
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#36 variables is a lot, let us choose couple to examine closer 


#Column names to keep in the data set
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

#Selecting the 'keep_columns' to create a new data set
tea_time <- dplyr::select(tea, one_of(keep_columns))

#The summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
#Visualization of the data set
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped


There is 300 observations of 36 variables. As 36 variables is quite a lot, only six of them has been taken into closer examination above: how, How, lunch, sugar, Tea and where. It seems that it is most common to drink tea alone using ready-made tea bags (How and how). There is some people drinking tea during lunch time but most of the answerers drink their tea at some other point of the day (lunch). It is nearly as common to drink the tea with sugar as it is not to use sugar (sugar). Earl Grey is clearly most used tea type (compared to black and green tea) (Tea). The tea is most likely from a chain store but also some people do buy their tea from tea shops (where).

Thus, if all the results could be concluded: It seems that it is most common to drink Earl Grey - bought from a chain store - using ready-made tea bags and without sugar alone during some other time of the day than lunch time.


Let us do the Multiple Correspondence Analysis (MCA) next. MCA is a dimensionality reduction method, used to analyse the pattern of relationships of several categorical variables. MCA is a generalization of PCA. The general aim is to condense and present the information of the cross-tabulations of categorical variables in a clear graphical form.

#Multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

#Summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |


In the summary of MCA we have first the Eigenvalues. There we can see the variances and the percentage of variances retained by each (11) dimension.

Then we have Individuals. Here we have the 10 first individuals coordinates, individuals contribution (%) on the dimension and the cos2 (the squared correlations) on the dimensions.

The summary contains also Categories. Here we have for the 10 first; the coordinates of the variable categories, the contribution (%), the cos2 (the squared correlations) and v-test value. Note, that the v-test follows normal distribution, ergo in case the value is below/above ± 1.96, the coordinate is significantly different from zero. This applies for most of the categories, excluding alone and other in the first table and green, milk and tea bag+unpackaged in the second table.

Lastly we have Categorical variables. There we have the squared correlation between each variable and the dimensions. Values close to 1 indicate a strong link with the variable and dimension. See how and Dim.1 (0.708); where and Dim.1 (0.702).

#Visualization of MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")


Here we have MCA factor map as biplot. The variables have been drawn on the first two dimensions (Dim 1 explaining ~15% and Dim 2 ~14%). The distance between variable categories giver a measure of their similarity (the more close the more similar).

  • The following are similar together:
    • milk, Earl Grey and sugar
    • tea bag and chain store
    • Not lunch and alone
    • unpackaged and tea shop


  • The following differ from all the others:
    • other
    • lemon
    • green
    • unpackaged and tea shop > similar together but differ from all the others




Summary of my learning


This week I got answers to some of the questions I had last week. The explanations how to interpret biplots were especially important and interesting. I think everything went well this week and therefore I am feeling quite exited!


Regards, Janina


Chapter 6 - Analysis of longitudinal data


Working with 2 sets of data


This week we have two datasets to investigate (BPRS and RATS). These datasets are longitudinal and thereby the response variable (and maybe even some or all of the explanatory variables) is observed several times > multivariate. The analysis of repeated measures and longitudinal data is in need of certain models that are able to both assess the effects of explanatory variables on the multiple measures of the response variable and account for the likely correlations between these measures.

First part of this week’s data analysis will focus on the RATS dataset. We will be looking at some graphical displays of the data and a possible approach to its’ analysis. Let us look closer how does the data look like.

RATS - dataset


The data is from a nutrition study that was conducted in three groups of rats in 1990 by Crowder and Hand. The three groups of rats had different types of diets. For 9 weeks each animal’s body weight (in grams) was recorded repeatedly approximately once a week > repeated measured data. The interest was to see whether there were any differences in the growth profiles of the three groups.

date()
## [1] "Thu Dec 09 16:07:39 2021"
RATSL = read.table(file="RATSL.txt", sep=",", header=TRUE)

#Factor ID & Group of RATS
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

#Let us see the dimensions and structure of the dataset
dim(RATSL)
## [1] 176   5
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ times : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...


There are 176 observations of 5 variables. First there is the ID-number of the rat and the group the animal belongs into. The variable ‘Times’ has been constructed out of ‘times’ and it tells how many times the weight has been measured. The variable ‘Weight’ is the weight of the rat in grams.

Graphical display

Individual response profiles


We shall begin by plotting the weight of the rats over time in the three groups (1, 2 and 3).

library(GGally)
library(dplyr)
library(corrplot)
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
#Drawing a plot using ggplot
ggplot(RATSL, aes(x = Time, y = Weight, col=Group)) +
  geom_point() +
  labs(x="Time (d)", y="Weight (g)", size="horsepower",
       col="Group")


Above, all the weight profiles of the three groups are presented in one graph. The colouring shows which group the rat is in (1 = orange, 2 = green, 3 = blue). Here it is clearly visible that the rats in group 1 are notably lighter than in the two other groups.

Let us next make a weight profiles following the MABS4IODS - chapter 8:

#Drawing a new plot using ggplot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID, color = ID)) +
  geom_point(size = 0.5) + 
  geom_line(aes(group = ID)) +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))


In the graph above we can first confirm that it really seems that the rats in the first group are lighter than in the other groups. In the second group most of the rats are weighting something between 400 g and 500 g except one rat which is approximately 100 g heavier. Overall the rats in third group are the heaviest, if the one exception from the group 2 is excluded.

In the groups 2 and 3 all the rats are gaining weight. The change in weight is nearly non-existent in the rats of group 1. In every group there is one rat which weight differs from the other rats > an outlier.

Summary measure analysis


The changes in the weight can be seen more clearly when the values have been standardized. In the following graph the weight of the rats have been standardized by subtracting the relevant occasion mean from the original observation and then dividing by the corresponding visit standard deviation.

#Standardising the variable Weight
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdweight = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()

#Plotting again with the standardized weight
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID, color = ID)) +
  geom_point(size = 0.5) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "Standardized weight")


We can still move on from the standardized weight profiles. Individual profiles are usually not in the main focus and their interpretation is not that meaningful at this point. Instead we will produce a graph showing average profiles for each group at each time point.

Thus, let us do that:

#Number of weighting times
n <- RATSL$Time %>% unique() %>% length()

#Summary data with mean and standard error of Weight by Group and Time
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
#Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, ~
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2~
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30~
#Plotting the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.95,0.6)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")


Above we have all three group averages in the same graph (see labels for the symbols on the right).

There is no overlapping between the groups. Group 1 has notably lighter weight profile which is increasing in time just slightly. The weight profile of group 3 is little heavier than the one of group 2. Overall, the weight profiles of groups 2 and 3 are showing similar increase in time. The error is greatest in group 2 meaning that there is most variance and uncertainty in the results (longest errorbars). The error in the group 1 seems to be smallest.

Summary measure analysis - Boxplot


#Creating a summary data by Group and ID
RATSSbp <- RATSL %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
#Drawing a boxplot of the mean versus group
ggplot(RATSSbp, aes(x = Group, y = mean, col=Group)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "Mean weight")


Above we have a boxplot of mean summary measures for the three groups. The outlier in every group (discussed earlier) can now be seen as a dot (maximum or minimum observation of the group) below the boxes in the cases of groups 1 and 3 and above the box in the case of group 2

The greatest outlier is in group 2 and it has quite large impact on its’ error. Let us next remove the outlier by filtering out the weight exceeding 550 grams.

#Filtering mean weight values exceeding 550 g
RATSSbpf <- RATSSbp %>%
  filter(mean < 550)
glimpse(RATSSbpf)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.~
#Drawing a boxplot of the mean versus group using the filtered data
ggplot(RATSSbpf, aes(x = Group, y = mean, col=Group)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "Mean weight")


Note how the boxplots of all three groups are now fairly similar together (size-wise). Removing the outlier makes it easier to see how the diet works for the majority.

Independent samples t-test on the mean summary measure


Let us next perform a formal test for a difference, ergo apply t-test in order to assess differences between the three groups and to calculate a confidence interval for these differences. We shall use the averaged data without the outlier.

library(rstatix)
## Warning: package 'rstatix' was built under R version 4.1.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.2
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:cowplot':
## 
##     get_legend
#Summary statistics of the data
RATSSbpf %>%
  group_by(Group) %>%
  get_summary_stats(mean, type = "mean_sd")
## # A tibble: 3 x 5
##   Group variable     n  mean    sd
##   <fct> <chr>    <dbl> <dbl> <dbl>
## 1 1     mean         8  264. 11.9 
## 2 2     mean         3  449.  7.55
## 3 3     mean         4  526. 22.1
#Completing a pairwise t-tests
pwt <- RATSSbpf %>%
  pairwise_t_test(mean ~ Group, p.adjust.method = "bonferroni")

pwt
## # A tibble: 3 x 9
##   .y.   group1 group2    n1    n2        p p.signif    p.adj p.adj.signif
## * <chr> <chr>  <chr>  <int> <int>    <dbl> <chr>       <dbl> <chr>       
## 1 mean  1      2          8     3 2.9 e-10 ****     8.71e-10 ****        
## 2 mean  1      3          8     4 1.56e-12 ****     4.68e-12 ****        
## 3 mean  2      3          3     4 1.79e- 5 ****     5.37e- 5 ****


The outcome is a 3x9 table. The three groups are presented as rows and from the columns we can see the significance. There is a highly significant evidence for the groups to differ from each other.

Analysis of covariance with baseline


Let us next fit a linear model with the mean as the response. From that we can compute the analysis of variance table for the fitted model using likelihood ratio test ANOVA.

RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep  ="\t", header = T)

#Adding the baseline
RATSL_B <- RATSSbp %>%
  mutate(baseline = RATS$WD1)

#Fitting the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSL_B)

#Computing the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The Analysis of Variance Table confirms that we have quite a strong evidence of a treatment differences between the groups, which confirms the outcome of the pairwise t-test. In fact, the differences could have been observed already from the first plots, but it is good to rely on reliable tests.

BPRS - dataset


Now we shall move on to the next dataset called ‘BPRS’. In this dataset 40 males were randomly assigned to one of two treatment groups and each person was rated on the Brief Psychiatric Rating Scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from 1 (not present) to 7 (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.

BPRSL = read.table(file="BPRSL.txt", sep=",", header=TRUE)

#Factor treatment & subject of BPRSL
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)

#Let us see the dimensions and structure of the dataset
dim(BPRSL)
## [1] 360   5
str(BPRSL)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...


There are 360 observations of 5 variables. First there is the ‘treatment’ of the person, which indicates which one of the two treatments have been used on ‘subject’ (persons’ “ID”-like number). The variable ‘week’ has been constructed out of ‘weeks’ and it tells which week is going on. The variable ‘bprs’ tells the brief psychiatric rating scale of the person.

Exploring BPRS by plotting


Let us plot the BPRS against the time (weeks).

#Plotting the BPRS data
ggplot(BPRSL, aes(x = week, y = bprs, col = subject)) +
  geom_point(size = 0.5) + 
  geom_line() + scale_x_continuous(name = "Week", breaks = seq(0, 8, 1)) + scale_y_continuous(name = "BPRS") + theme(legend.position = "top")


Putting all the observations in one figure is not very informative but fairly messy. Let us separate the subjects by the two treatments.

#Plotting the BPRS data once more
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject, col = subject)) +
  geom_point(size = 0.5) + 
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))


The graphs look a lot clearer now. We can already say that the rate of BPRS is decreasing for nearly all men during the eight weeks period. However, there is also some individual differences and variability. The overall variablity between the observations seems to decrease in time. There is not too much of rapid overlapping but the men whose BPRS is higher in the beginning seem to remain higher throughout the study.

Let us next repeat the plot using standardized values of each observation averaged by the treatment.

#Number of subject
n <- BPRSL$subject %>% unique() %>% length()

#Summary data with mean and standard error of bprs by treatment and week 
BPRSS <- BPRSL %>%
  group_by(treatment, week) %>%
  summarise(mean = mean(bprs), se = sd(bprs)/sqrt(n)) %>%
  ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the `.groups` argument.
#Plotting the mean profiles of subjects
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
  geom_line(aes(group = treatment)) +
  scale_linetype_manual(values = c(1,2)) +
  geom_point(size=4) +
  scale_shape_manual(values = c(1,2)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.2) +
  theme(legend.position = c(0.95,0.6)) +
  scale_y_continuous(name = "mean(BPRS) +/- se(BPRS)") 


Above we have both treatment group averages in the same graph (see labels for the symbols on the right).

There is now considerable overlap in the mean profiles. This may suggest that there is some differences between the two treatment groups with respect to the mean BPRS values. However, both treatment groups show decrease in the value of BPRS during the eight weeks period.

Linear mixed effects models for repeated measures data


Linear mixed effects model is suitable for responses that are assumed to be approximately normally distributed after conditioning on the explanatory variables. This model formalize the idea that an individual’s pattern of responses is likely to depend on many characteristics of that individual (including some that are unobserved). The unobserved variables are included in the model as random variables > random effects. The essential feature of the model is that the (usually positive) correlation among the repeated measurements on the same individual arises from shared, unobserved variables. MABS4IODS - chapter 9

Let us next fit a linear regression model ignoring the repeated measures structure of the data. BPRS will act as response variable, and week + treatment as explanatory variables.

#Creating a regression model RATS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)

#Printing a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16


In the results above ‘intercept’ and ‘week’ seem to be strongly significant. Instead, treatment 2 does not give any significant difference. However, the model cannot be trusted as we are aware that there is likely correlation between different measurements of the same subject. This correlation will be taken into account as a random factor.

Random intercept model.


Let us continue by using a random intercept model. The random component is assumed to be normally distributed and constant in time. Thereby, the linear fit for each subject is now allowed to differ in intercept from other subjects.

library(lme4)
## Warning: package 'lme4' was built under R version 4.1.2
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
#Creating a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

#Printing the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000


From the t-values above we can again confirm that there is greater evidence of a significant difference when it comes to ‘intercept’ and ‘week’, whereas treatment 2 remains insignificant. When we compare the standard error of ‘week’ we can observe, that it has decreased slightly from the previous model. This is due to taking the time correlation into account. In addition, the standard error of the ‘treatment 2’ has decreased.

Random intercept and random slope model


We may still improve our model. The random intercept model does not always represent the observed pattern of variance and correlations between measurements in longitudinal data. Thereby, we will next fit a random intercept and random slope model. This allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the BPRS profiles and also the effect of time. The two random effects are assumed to have a bivariate normal distribution.

#Creating a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

#Printing a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
#Performing an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


From the table resulting from the ANOVA likelihood ratio test we can see that the p-value resulting from the chi-squared statistic is relatively small (0.02636 > 1% significance level). The low value implies that the new model (BPRS_ref1) provides a better fit against the comparison model (BPRS_ref).

Let us still improve the model with one more detail.

Random intercept and slope model allowing a treatment × week interaction


The final fit will be done with a random intercept and slope model that allows for a group × week (time) interaction.

#Creating a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)

#Printing a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0513 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0049  8.0626        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4702  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
#Performing an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


From the results of the likelihood ratio test of the interaction random intercept and slope model against the previous model we may observe that the competence of the new model is actually worse. The degrees of freedom have decreased from 2 to 1 and the chi-squared p-value is no longer inside the level of significance (from 0.02636 to 0.074, ergo significance worsens from 1% to 5%). Thus, the random intercept and random slope model excluding the treatment x time interaction works better in our case.

Thus, let us continue with the previous model (BPRS_ref1) in question.

Fitted BPRS profiles from the interaction model and observed BPRS profiles


Next, using the previous model (the random intercept and random slope model) we will try to find and plot the fitted BPRS profiles, and then compare with the observed values for both of the treatment groups.

#Drawing the plot of BPRS with the observed bprs values
g_obs <- ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject, col = subject)) +
  geom_point(size = 0.5) + 
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

#Creating a vector of the fitted values
Fitted <- fitted(BPRS_ref2)

#Creating a new column fitted to BPRS
#BPRSL <- mutate(BPRSL, fitted = Fitted)

BPRSL <- BPRSL %>% mutate(Fitted)


#Drawing the plot of BPRSL with the Fitted values of bprs
g_fit <- ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject, col = subject)) +
  geom_point(size = 0.5) + 
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_y_continuous(limits = c(min(BPRSL$Fitted), max(BPRSL$Fitted)))

g_obs; g_fit


The model seems to fit quite well with the observed data. The linear model may not be the best option for this data but from the models tried above this one works the best. From here we can say, that both groups show clear decrease in the values of BPRS in time and that there seems not to be notable differences between the two groups. Both treatments worked equally well.



Summary of my learning


Phew, that was the last week! This demanded quite a lot work from me. It was odd to try to implement the methods to the opposite datasets. Some challenges were faced but I think I managed quite okay. I feel relieved that I got everything done and even though this week’s tasks were quite heavy I actually enjoyed it (after I got the hang of it).


Thank you for the course


Enjoy your holidays, you’ve earned it!


ctree <- function(N=10){
  for (i in 1:N) cat(rep("",N-i+1),rep("*",i),"\n")
  cat(rep("",N),"*\n")
}

ctree()
##           * 
##          * * 
##         * * * 
##        * * * * 
##       * * * * * 
##      * * * * * * 
##     * * * * * * * 
##    * * * * * * * * 
##   * * * * * * * * * 
##  * * * * * * * * * * 
##           *

Regards, Janina